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Abstract 

We  relax  the  causality  assumption  in  formulating  recurrent  neural  networks,  so  that  the  hidden  states  of 
the  network  are  all  coupled  together.  This  goes  beyond  bidirectional  RNN,  which  consists  of  two  explicit 
recurrent  networks  concatenated  together.  The  motivation  behind  doing  this  is  to  improve  performance 
on  long-range  dependencies,  and  to  improve  stability  (solution  drift)  in  NLP  tasks.  We  choose  an  implicit 
neural  network  architecture,  show  that  it  can  be  computed  reasonably  efficiently,  and  demonstrate  it  proof- 
of-concept  on  the  task  of  part-of- speech  tagging. 

1  Introduction 

Feedforward  neural  networks  were  designed  to  approximate  and  interpolate  functions.  Recurrent  networks  were 
developed  to  predict  sequences.  These  recurrent  networks  can  be  ‘unwrapped’,  and  thought  of  as  a  very  deep 
feedforward  network,  with  each  layer  sharing  the  same  set  of  weights.  Computation  proceeds  one  step  at  a  time, 
like  the  trajectory  of  an  ordinary  differential  equation  when  solving  an  initial  value  problem.  However,  in  cer¬ 
tain  applications  in  natural  language  processing,  especially  those  with  long-distance  effects,  and  where  grammar 
matters,  sequence  prediction  may  be  better  thought  of  as  a  boundary  value  problem.  In  this  work,  we  propose 
challenging  the  traditional  left-to-right  causality  of  neural  networks,  and  demonstrate  the  feasibility  and  potential 
power  of  implicit  methods. 

2  Comparison  with  previous  work 

Long-range  dependencies  have  been  an  issue  as  long  as  there  have  been  NLP  tasks,  and  there  are  many  effective 
approaches  to  dealing  with  them.  In  the  context  of  HMMs,  there  are  the  “Forward-Backward”  models.  In  infor¬ 
mation  extraction,  there  are  non-local  sequence  models  that  use  Gibbs  sampling  (Finkel  et  al.,  2005).  In  recent 
years,  there  is  the  bidirectional  LSTM  (Schuster  and  Paliwal,  1997)  that  incorporates  past  and  future  hidden  states 
via  two  separate  recurrent  networks.  Unlike  most  of  the  previous  methods,  this  method  is  not  able  to  be  simplified 
into  a  dynamic  programming  technique.  The  resulting  states  for  each  sequence  element  are  more  strongly  coupled 
to  each  other,  with  potential  for  emergent  effects.  Solving  the  equations  is  more  difficult,  however,  and  make  use 
of  techniques  for  solving  nonlinear  systems  of  equations. 

3  Recurrent  Neural  Networks 

A  typical  recurrent  neural  network  has  the  following  specification: 

•  Input  sequence  [aq,  £2,  •  •  • ,  xs] 

•  Initial  state  given  as  order  k  tensor  ho 

•  Transition  Function  h!  =  f(x,h) 

Produce  order  k  - hi  tensor: 

[ho,  hi  =  f(xi,h0),  h2  =  f{x2,hi),  ...,hs  =  f(xs,  hs- 1)] 

Long-short-term-memory  (Hochreiter  and  Schmidhuber,  1997),  gated-recurrent  (Cho  et  al.,  2014),  and  its  vari¬ 
ants  follow  this  formula,  with  different  choices  for  the  state  transition  function.  Computation  proceeds  left-to- 
right,  with  each  next  state  depending  on  the  previously  computed  hidden  state.  It  is  a  very  reasonable  assump¬ 
tion,  and  it  makes  computation  straightforward  and  tractable.  However,  what  would  happen  without  it?  Suppose 
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Figure  1:  Traditional  recurrent  neural  network  structure. 
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ht  =  f(xt,  ht~ i,  ht+ 1),  or  an  even  wider  stencil?  We  arrive  at  a  system  of  nonlinear  equations.  This  setup  has  the 
potential  for  arriving  at  nonlocal,  whole- sequence  dependent  results.  Analogously  to  the  theory  of  ODEs,  it  may 
also  be  more  ‘stable’,  whereby  the  predicted  sequence  may  drift  less  from  the  true  meaning,  since  errors  will  not 
compound  with  each  time  step  in  the  same  way. 


4  Implicit  RNN 

4.1  Architecture  for  this  work 

There  are  many  possibilities  of  how  to  architect  a  neural  network  -  in  fact,  this  is  one  of  its  best  features  -  but  we 
restrict  our  discussion  to  the  one  depicted  in  Figure  2.  In  this  setup,  we  have  the  following  variables: 


and  functions: 


input  data 
input  labels 
parameters 
transformed  input 
hidden  layers 


X 

Y 

6 

£ 

H 


loss  function 

implicit  hidden  layer  definition 

input  layer  transformation 


L  =  £(6,H,  Y ) 
H-F(0,£,H)  =  0 

t  =  g{0,x) 


Our  implicit  definition  function,  F,  is  made  up  of  local  state  transitions,  and  forms  a  system  of  nonlinear  equa¬ 
tions  that  we  need  to  solve: 


hi  —  f(ho,h2,£i) 

hi  =  f  (hi— i ,  hi+i , 

hn  f  (hn— i ,  hn-\-i ,  «^n) 


4.2  Computing  the  forward  pass 

To  evaluate  the  network,  we  must  solve  the  equation  H  =  F(H).  We  computed  this  via  an  approximate  Newton 
solve: 


Hn+l  =  Hn-{I-VHFT\Hn-F{Hn)) 

Hn+1  »  Hn-  P(y HF)(Hn  -  F{Hn)) 

We  use  the  geometric  series  polynomial  Pn(x)  =  1  +  x  +  x1  +  . . .  +  x",  which  converges  provided  that 
I|VhF||  <  1.  For  most  of  our  experiments  we  even  let  n  =  1.  Note  that  n  =  0  is  fixed  point  iteration,  which  has 
also  worked  in  experiments,  albeit  with  slower  convergence. 
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Figure  2:  For  the  remainder  of  this  paper,  we  will  focus  on  the  following  ‘word  classification’  architecture. 


The  assumption  that  ||V#F||  <  1  probably  works  due  to  the  initialization  of  the  network.  Weight  matrices 
for  the  bidirectional  transition  function  (see  Section  4.4)  strongly  affect  the  size  of  the  overall  gradient,  and  we 
initialize  all  parameters  with  the  range  [—.1,  .1].  In  experiments,  as  the  magnitudes  of  the  weights  grow,  the  solver 
takes  more  iterations,  however,  it  does  not  seem  to  reach  a  point  where  the  approximation  is  wholly  inadequate. 

4.3  Gradients 

In  order  to  train  the  model,  we  perform  gradient  descent.  Taking  the  gradient  of  the  loss  function: 


so  we  will  need  to  know  the  gradient  of  the  hidden  units  with  respect  to  the  parameters,  which  we  can  find  via 
the  implicit  definition: 
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The  entire  gradient  is  then 


VgL  =  Vg£  +  -  Vff F)-1  (V0F  +  V€FV*0 

Once  again,  the  inverse  of  I  —  V#F  appears,  and  we  can  approximate  it  via  the  polynomial  P(x)  above. 

4.3.1  Computing  the  gradient 

The  multiplication  are  best  performed  left-to-right;  this  way,  they  are  vector-matrix  multiplies,  and  not  matrix- 
matrix  multiplications.  The  terms  in  the  gradient  calculation  have  the  dimensions  below: 
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Currently,  training  examples  are  fed  into  the  system  one  at  a  time,  and  the  costs  and  gradients  are  accumulated 
into  a  batch  before  updating.  We  make  use  of  Theano’s  (Bergstra  et  al.,  2011)  tensor  library  extensively,  especially 
the  convenient  Rop ,  Lop  operators  that  right  or  left-multiply  the  jacobian  of  a  term  with  a  vector. 


4.4  Transition  Functions 


The  simplest  RNN  transition  function  would  be  the  traditional  RNN  layer,  with  an  extra  term  for  the  next  hidden 
state: 


hi  —  a(Wphi- 1  +  Wnhi+i  +  bi) 

Preliminary  experiments  did  not  show  good  convergence  properties.  Alternatively,  we  modify  the  Gated  Recur¬ 
rent  Unit  Transition  Function,  to  be  bidirectional.  Recall  the  original  GRU  equations: 

}t  =  (1  -  zt)ht- 1  +  ztht 
it  =  tanh  (Wxt  +  U(rtht- 1)  +  b) 

%  =  &(Wzxt  +  Uzht- 1  +  bz) 
t  —  cr(WrXt  H-  Urht-i  T  ) 

lation  of  previous  and  next  hidden  states,  via  a  switch: 

ht,c  sht— l  (1  s)ht-\- 1 

8  =  Sl 

Sl  S-)r- 

—  &(yVXsi%t  T~  "Wsifot— i  T~ 

Sr  —  &{yVxrl'E't  H-  (-1  H-  ^sr) 


Part  of  speech  tagging  is  a  natural  choice  for  our  ‘word  classifier’  described  above.  Since  part-of-speech  tagging 
is  a  mature  field,  our  aim  is  not  to  build  the  best  tagger  in  the  world,  but  to  show  the  new  architecture  in  action.  To 
train  a  part-of-speech  tagger,  we  simply  let  L  be  a  softmax  layer  transforming  each  hidden  unit  output  into  a  part 
of  speech  tag.  Initially,  £  consisted  only  of  word  vectors  for  39,000  case-sensitive  vocabulary  words.  Next,  we 
lowercased  the  vocabulary  words,  but  added  a  single  feature  indicating  whether  case  appeared  in  the  data.  Third, 
we  added  six  additional  ‘word  vector’  components  to  encode  the  top-2000  most  common  prefixes  and  suffixes 
of  words,  for  affix  lengths  2  to  4.  Finally,  we  added  in  other  (binary)  features  to  indicate  numbers,  symbols, 
punctuation,  and  more  rich  case  data,  as  used  by  (Huang  et  al.,  2015). 

We  trained  the  POS  tagger  on  the  Wall  Street  Journal  corpus,  blocks  0-18,  validated  on  19-21,  and  tested  on 
22-24.  We  also  tested  it  on  the  TED  treebank  (Neubig  et  al),  and  compared  it  to  the  results  of  the  off-the-shelf 
Stanford  Part-of-Speech  tagger.  The  results  are  indicated  in  Table  1.  We  were  able  to  achieve  comparable  results, 
and  as  Manning  notes,  performance  gains  past  that  point  are  quite  difficult,  due  to  errors/inconsistencies  in  the 
dataset,  ambiguity,  and  very  difficult  linguistics,  sometimes  with  dependencies  across  sentences  (Manning,  2011). 

Training  was  done  using  stochastic  gradient  descent,  with  an  initial  learning  rate  of  0.5.  The  costs  and  gradients 
were  all  done  for  individual  sentences  on  a  GPU,  and  then  aggregated  together  in  a  batch  of  size  50  before  updating 
the  parameters  of  the  model.  Word  vectors  were  of  dimension  200,  prefix  and  suffix  vectors  were  of  dimension 
20.  Hidden  unit  size  was  equal  to  feature  input  size,  so  in  this  case,  321.  Training  this  way  is  takes  about  20 
seconds  per  batch,  and  without  the  GPU,  it  is  approximately  the  same  speed,  however  it  uses  all  12  CPU  cores  of 
the  machine  running  the  experiment.  Therefore  for  a  multi-gpu  machine  it  makes  more  sense  to  use  the  GPU  and 
run  several  experiments  in  parallel. 

We  also  visualized  some  of  the  outputs  of  the  “switch”  variables  for  various  sentences.  The  switch  is  made  up  of 
many  features,  so  it  does  not  necessarily  always  correspond  to  human  judgment,  but  by  taking  the  average,  one  can 
get  a  sense  of  the  flow  of  information.  In  Figure  3,  we  see  a  visualization  of  the  switch  on  a  very  simple  sentence, 
and  in  Figure  4  we  see  it  in  action  over  a  more  complicated  sentence.  Interestingly,  phrasal  structures  emerge. 

6  Experiments:  Sentiment  analysis 

We  used  the  Stanford  IMDB  movie  review  corpus  (Maas  et  al.,  2011),  training  and  validating  on  25,000  movie 
reviews  (22.5K/2.5K  split),  and  testing  on  a  separate  25,000.  The  corpora  are  evenly  split  into  half  positive,  and  half 
negative  reviews.  We  set  it  up  as  a  binary  classification  problem  (like/dislike),  and  use  the  architecture  inspired  by 
Theano’s  deep  learning  for  sentiment  analysis  tutorial1 :  after  the  encoding  of  the  hidden  states  via  either  a  recurrent 
net  (LSTM  or  similar)  or  an  implicit  network,  the  hidden  states  are  summed  together  across  words  (and  averaged) 
before  feeding  into  a  final  binary  logistic  regression  classifier.  The  performance  of  our  model  and  a  few  baselines 
are  noted  in  Table  3.  As  in  the  theano  example,  the  model  was  trained  with  Adadelta  (Zeiler,  2012)  because  this 
model  seems  to  train  very  poorly  with  stochastic  gradient  descent.  Additionally,  we  used  a  batch  size  of  1 ,  which 
we  surprisingly  noted  yielded  faster  convergence  in  this  problem.  Because  of  the  extremely  long  sentences  in  each 
training  example,  we  used  the  third  order  approximation  to  the  matrix  inverse,  described  earlier. 

1  http  ://deeplearning  .net/tutorial/lstm.html 
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5  Experiments:  Part-of-speech  tagging 
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Figure  3:  Visualization  of  the  switch  variable.  Values  above  0  indicate  a  right-to-left  flow  of  information,  while 
values  below  0  indicate  left-to-right.  Note  that  ‘Tokyo’  is  used  to  modify  ‘market’,  instead  of  being  a  noun,  and 
thus  needs  information  from  ‘market’  to  make  the  correct  determination. 
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Figure  4:  A  more  common  and  complicated  example.  Note  the  entire  clause  about  ‘Health  Care  Property  Investors 
Inc.’  propagates  information  from  the  right  -  without  the  Inc,  it  may  not  realize  it  is  the  name  of  a  company.  Also 
of  note  are  phrases  “offering  of  200,000  shares”  and  “Merrill  Lynch  Capital  Markets.” 
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Tagger 

WSJ  Accuracy 

Word  vectors  only 

0.9626 

Single  case  feature 

0.9650 

Ensemble  of  above  (2) 

0.9683 

Affix  word- vectors 

0.9714 

Ensemble  of  above  (4) 

0.9731 

Case+Symbol  feats 

0.9730 

Ensemble  of  above  (4) 

0.9736 

Stanford  POS  Tagger 

0.9732 

Table  1 :  Tagging  performance. 


Architecture 

WSJ  Accuracy 

Gated  recurrent 

96.51 

LSTM 

96.53 

Bidirectional  GRU 

97.26 

Bidirectional  LSTM 

97.27 

Implicit 

97.30 

Table  2:  Tagging  performance  relative  to  other  recurrent  architectures. 


Architecture 

IMDB  Accuracy 

Gated  recurrent 

0.8658 

LSTM 

0.8784 

B-LSTM 

0.8836 

Implicit 

0.8810* 

Table  3:  Sentiment  performance  relative  to  other  recurrent  architectures,  ^incomplete  result 
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